Perf: Optimize in memory sort #15380

zhuqi-lucas · 2025-03-24T09:42:24Z

Which issue does this PR close?

Closes #15375

Rationale for this change

Perf: Support automatically concat_batches for sort which will improve performance

And it's mergable for the first version, later we can improve it according to comments:

#15375 (comment)

What changes are included in this PR?

Perf: Support automatically concat_batches for sort which will improve performance

Are these changes tested?

Yes

Are there any user-facing changes?

No

Dandandan · 2025-04-12T07:35:03Z

datafusion/physical-plan/src/sorts/sort.rs

+        let mut current_batches = Vec::new();
+        let mut current_size = 0;
+
+        for batch in std::mem::take(&mut self.in_mem_batches) {


I think it would be nice to use pop (while let Some(batch) = v.pop) here to remove the batch from the vec once sorted to reduce memory usage. Now the batch is AFAIK retained until after the loop.

I think it would be nice to use pop (while let Some(batch) = v.pop) here to remove the batch from the vec once sorted to reduce memory usage. Now the batch is AFAIK retained until after the loop.

Thank you @Dandandan for review and good suggestion, addressed your suggestion!

Dandandan · 2025-04-12T07:37:13Z

I think this is already looking quite nice. What do you need to finalize this @zhuqi-lucas

zhuqi-lucas · 2025-04-12T14:40:14Z

I think this is already looking quite nice. What do you need to finalize this @zhuqi-lucas

Thank you @Dandandan for review, i think we just need to add the benchmark result for this PR for next step.

And it's mergable for the first version, later we can improve it according to comments:

#15375 (comment)

zhuqi-lucas · 2025-04-12T15:10:44Z

@alamb Do we have the CI benchmark running now? If no, i need your help to run... Thanks a lot!

And also for the sort-tpch itself, i was running for the improvement result, but not for other benchmark running.

Previous sort-tpch:

┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       main ┃ concat_batches_for_sort ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ Q1           │  2241.04ms │               1816.69ms │ +1.23x faster │
│ Q2           │  1841.01ms │               1496.73ms │ +1.23x faster │
│ Q3           │ 12755.85ms │              12770.18ms │     no change │
│ Q4           │  4433.49ms │               3278.70ms │ +1.35x faster │
│ Q5           │  4414.15ms │               4409.04ms │     no change │
│ Q6           │  4543.09ms │               4597.32ms │     no change │
│ Q7           │  8012.85ms │               9026.30ms │  1.13x slower │
│ Q8           │  6572.37ms │               6049.51ms │ +1.09x faster │
│ Q9           │  6734.63ms │               6345.69ms │ +1.06x faster │
│ Q10          │  9896.16ms │               9564.17ms │     no change │
└──────────────┴────────────┴─────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (main)                      │ 61444.64ms │
│ Total Time (concat_batches_for_sort)   │ 59354.33ms │
│ Average Time (main)                    │  6144.46ms │
│ Average Time (concat_batches_for_sort) │  5935.43ms │
│ Queries Faster                         │          5 │
│ Queries Slower                         │          1 │
│ Queries with No Change                 │          4 │
└────────────────────────────────────────┴────────────┘

zhuqi-lucas · 2025-04-12T15:40:57Z

Latest result based current latest code:

--------------------
Benchmark sort_tpch1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃     main ┃ concat_batches_for_sort ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ Q1           │ 153.49ms │                137.57ms │ +1.12x faster │
│ Q2           │ 131.29ms │                120.93ms │ +1.09x faster │
│ Q3           │ 980.57ms │                982.22ms │     no change │
│ Q4           │ 252.25ms │                245.09ms │     no change │
│ Q5           │ 464.81ms │                449.27ms │     no change │
│ Q6           │ 481.44ms │                455.45ms │ +1.06x faster │
│ Q7           │ 810.73ms │                709.74ms │ +1.14x faster │
│ Q8           │ 498.10ms │                491.12ms │     no change │
│ Q9           │ 503.80ms │                510.20ms │     no change │
│ Q10          │ 789.02ms │                706.45ms │ +1.12x faster │
│ Q11          │ 417.39ms │                411.50ms │     no change │
└──────────────┴──────────┴─────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (main)                      │ 5482.89ms │
│ Total Time (concat_batches_for_sort)   │ 5219.53ms │
│ Average Time (main)                    │  498.44ms │
│ Average Time (concat_batches_for_sort) │  474.50ms │
│ Queries Faster                         │         5 │
│ Queries Slower                         │         0 │
│ Queries with No Change                 │         6 │
└────────────────────────────────────────┴───────────┘
--------------------
Benchmark sort_tpch10.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       main ┃ concat_batches_for_sort ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ Q1           │  2243.52ms │               1825.64ms │ +1.23x faster │
│ Q2           │  1842.11ms │               1639.00ms │ +1.12x faster │
│ Q3           │ 12446.31ms │              11981.63ms │     no change │
│ Q4           │  4047.55ms │               3715.96ms │ +1.09x faster │
│ Q5           │  4364.46ms │               4503.51ms │     no change │
│ Q6           │  4561.01ms │               4688.31ms │     no change │
│ Q7           │  8158.01ms │               7915.54ms │     no change │
│ Q8           │  6077.40ms │               5524.08ms │ +1.10x faster │
│ Q9           │  6347.21ms │               5853.44ms │ +1.08x faster │
│ Q10          │ 11561.03ms │              14235.69ms │  1.23x slower │
│ Q11          │  6069.42ms │               5666.77ms │ +1.07x faster │
└──────────────┴────────────┴─────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (main)                      │ 67718.04ms │
│ Total Time (concat_batches_for_sort)   │ 67549.58ms │
│ Average Time (main)                    │  6156.19ms │
│ Average Time (concat_batches_for_sort) │  6140.87ms │
│ Queries Faster                         │          6 │
│ Queries Slower                         │          1 │
│ Queries with No Change                 │          4 │
└────────────────────────────────────────┴────────────┘

Dandandan · 2025-04-12T19:00:12Z

Thanks for sharing the results @zhuqi-lucas this is really interesting!

I think it mainly shows that we probably should try and use more efficient in memory sorting (e.g. an arrow kernel that sorts multiple batches) here rather than use SortPreservingMergeStream which is intended to be used on data streams.
The arrow kernel would avoid the regressions of concat.

alamb · 2025-04-14T20:16:54Z

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.8.0-1016-gcp #18-Ubuntu SMP Fri Oct 4 22:16:29 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Comparing concat_batches_for_sort (6063bc5) to 0b01fdf diff
Benchmarks: clickbench_1 clickbench_partitioned sort_tpch1
Results will be posted here when complete

alamb · 2025-04-14T20:28:34Z

Thanks for sharing the results @zhuqi-lucas this is really interesting!

I think it mainly shows that we probably should try and use more efficient in memory sorting (e.g. an arrow kernel that sorts multiple batches) here rather than use SortPreservingMergeStream which is intended to be used on data streams. The arrow kernel would avoid the regressions of concat.

I think the SortPreservingMergeStream is about as efficient as we know how to make it

Maybe we can look into what overhead makes concat'ing better 🤔 Any per-stream overhead we can improve in SortPreservingMergeStream would likely flow directly to any query that does sorts

Dandandan · 2025-04-15T01:35:14Z

Hm that doesn't make much sense as

Thanks for sharing the results @zhuqi-lucas this is really interesting!
I think it mainly shows that we probably should try and use more efficient in memory sorting (e.g. an arrow kernel that sorts multiple batches) here rather than use SortPreservingMergeStream which is intended to be used on data streams. The arrow kernel would avoid the regressions of concat.

I think the SortPreservingMergeStream is about as efficient as we know how to make it

Maybe we can look into what overhead makes concat'ing better 🤔 Any per-stream overhead we can improve in SortPreservingMergeStream would likely flow directly to any query that does sorts

Hm 🤔 ... but that will still take a separate step of sorting the input bathes, which next to sorting involves a full extra copy using take (slower than concat) followed by merging the batches. Also the built-in sort on the entire output is likely to be much faster than doing a merge on the outputs.

I think the most efficient way would be to sort the indices to the arrays in one step followed by interleave, without either concat or sort followed by merge which would benefit the most from the built in sort algorithm and avoids copying the data.

zhuqi-lucas · 2025-04-15T05:49:16Z

It seems when we merge the sorted batch, we already using the interleave to merge the sorted indices, here is the code:

    /// Drains the in_progress row indexes, and builds a new RecordBatch from them
    ///
    /// Will then drop any batches for which all rows have been yielded to the output
    ///
    /// Returns `None` if no pending rows
    pub fn build_record_batch(&mut self) -> Result<Option<RecordBatch>> {
        if self.is_empty() {
            return Ok(None);
        }

        let columns = (0..self.schema.fields.len())
            .map(|column_idx| {
                let arrays: Vec<_> = self
                    .batches
                    .iter()
                    .map(|(_, batch)| batch.column(column_idx).as_ref())
                    .collect();
                Ok(interleave(&arrays, &self.indices)?)
            })
            .collect::<Result<Vec<_>>>()?;

        self.indices.clear();

But this PR, we also concat some batches into one batch, do you mean we can also use the indices from each batch to one batch just like the merge phase?

zhuqi-lucas · 2025-04-15T05:51:05Z

🤖 ./gh_compare_branch.sh Benchmark Script Running Linux aal-dev 6.8.0-1016-gcp #18-Ubuntu SMP Fri Oct 4 22:16:29 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux Comparing concat_batches_for_sort (6063bc5) to 0b01fdf diff Benchmarks: clickbench_1 clickbench_partitioned sort_tpch1 Results will be posted here when complete

Thanks @alamb for this triggering, it seems stuck.

Dandandan · 2025-04-15T07:09:10Z

But this PR, we also concat some batches into one batch, do you mean we can also use the indices from each batch to one batch just like the merge phase?

I mean theoretically we don't have to merge anything as all the batches are in memory.

The merging is useful for sorting streams of data, but I think it is expected the process of sorting batches first followed by a custom merge implementation is slower than a single sorting pass based on rust std unstable sort (which is optimized for doing a minimal amount of comparisons quickly).

Dandandan · 2025-04-15T07:12:51Z

A more complete rationale / explanation of the same idea was written here by @2010YOUY01 #15375 (comment)

An alternative to try to avoid copies is: first sort all elements' indices (2-level index consists of (batch_idx, row_idx)), and get a permutation array.
Use the interleave kernel to construct the final result https://docs.rs/arrow/latest/arrow/compute/kernels/interleave/fn.interleave.html

zhuqi-lucas · 2025-04-15T07:25:21Z

But this PR, we also concat some batches into one batch, do you mean we can also use the indices from each batch to one batch just like the merge phase?

I mean theoretically we don't have to merge anything as all the batches are in memory.

The merging is useful for sorting streams of data, but I think it is expected the process of sorting batches first followed by a custom merge implementation is slower than a single sorting pass based on rust std unstable sort (which is optimized for doing a minimal amount of comparisons quickly).

I think i got it now, thank you @Dandandan, it means we already have those in memory batch, we just need to first sort all elements' indices (2-level index consists of (batch_idx, row_idx)), we don't need to construct the StreamingMergeBuilder for in memory sort, we just need to sort it as a single sorting pass.

Let me try this way, and compare the performance!

zhuqi-lucas · 2025-04-15T08:16:21Z

Very interesting, firstly i now try merge all memory batch, and single sort, some query become crazy fast and some crazy slow, i think because:

We sort in memory without merge, it's similar to sort single partition without partition parallel ？
Previous some merge will have partition parallel?

So next step, we can try to make the in memory sort with parallel?

--------------------
Benchmark sort_tpch10.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       main ┃ concat_batches_for_sort ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ Q1           │  2243.52ms │               1416.52ms │ +1.58x faster │
│ Q2           │  1842.11ms │               1096.12ms │ +1.68x faster │
│ Q3           │ 12446.31ms │              12535.45ms │     no change │
│ Q4           │  4047.55ms │               1964.73ms │ +2.06x faster │
│ Q5           │  4364.46ms │               5955.70ms │  1.36x slower │
│ Q6           │  4561.01ms │               6275.39ms │  1.38x slower │
│ Q7           │  8158.01ms │              19145.68ms │  2.35x slower │
│ Q8           │  6077.40ms │               5146.80ms │ +1.18x faster │
│ Q9           │  6347.21ms │               5544.48ms │ +1.14x faster │
│ Q10          │ 11561.03ms │              23572.68ms │  2.04x slower │
│ Q11          │  6069.42ms │               4810.88ms │ +1.26x faster │
└──────────────┴────────────┴─────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (main)                      │ 67718.04ms │
│ Total Time (concat_batches_for_sort)   │ 87464.44ms │
│ Average Time (main)                    │  6156.19ms │
│ Average Time (concat_batches_for_sort) │  7951.31ms │
│ Queries Faster                         │          6 │
│ Queries Slower                         │          4 │
│ Queries with No Change                 │          1 │
└────────────────────────────────────────┴────────────┘

Patch tried:

diff --git a/datafusion/physical-plan/src/sorts/sort.rs b/datafusion/physical-plan/src/sorts/sort.rs
index 7fd1c2b16..ec3cd89f3 100644
--- a/datafusion/physical-plan/src/sorts/sort.rs
+++ b/datafusion/physical-plan/src/sorts/sort.rs
@@ -671,85 +671,14 @@ impl ExternalSorter {
             return self.sort_batch_stream(batch, metrics, reservation);
         }

-        // If less than sort_in_place_threshold_bytes, concatenate and sort in place
-        if self.reservation.size() < self.sort_in_place_threshold_bytes {
-            // Concatenate memory batches together and sort
-            let batch = concat_batches(&self.schema, &self.in_mem_batches)?;
-            self.in_mem_batches.clear();
-            self.reservation
-                .try_resize(get_reserved_byte_for_record_batch(&batch))?;
-            let reservation = self.reservation.take();
-            return self.sort_batch_stream(batch, metrics, reservation);
-        }
-
-        let mut merged_batches = Vec::new();
-        let mut current_batches = Vec::new();
-        let mut current_size = 0;
-
-        // Drain in_mem_batches using pop() to release memory earlier.
-        // This avoids holding onto the entire vector during iteration.
-        // Note:
-        // Now we use `sort_in_place_threshold_bytes` to determine, in future we can make it more dynamic.
-        while let Some(batch) = self.in_mem_batches.pop() {
-            let batch_size = get_reserved_byte_for_record_batch(&batch);
-
-            // If adding this batch would exceed the memory threshold, merge current_batches.
-            if current_size + batch_size > self.sort_in_place_threshold_bytes
-                && !current_batches.is_empty()
-            {
-                // Merge accumulated batches into one.
-                let merged = concat_batches(&self.schema, &current_batches)?;
-                current_batches.clear();
-
-                // Update memory reservation.
-                self.reservation.try_shrink(current_size)?;
-                let merged_size = get_reserved_byte_for_record_batch(&merged);
-                self.reservation.try_grow(merged_size)?;
-
-                merged_batches.push(merged);
-                current_size = 0;
-            }
-
-            current_batches.push(batch);
-            current_size += batch_size;
-        }
-
-        // Merge any remaining batches after the loop.
-        if !current_batches.is_empty() {
-            let merged = concat_batches(&self.schema, &current_batches)?;
-            self.reservation.try_shrink(current_size)?;
-            let merged_size = get_reserved_byte_for_record_batch(&merged);
-            self.reservation.try_grow(merged_size)?;
-            merged_batches.push(merged);
-        }
-
-        // Create sorted streams directly without using spawn_buffered.
-        // This allows for sorting to happen inline and enables earlier batch drop.
-        let streams = merged_batches
-            .into_iter()
-            .map(|batch| {
-                let metrics = self.metrics.baseline.intermediate();
-                let reservation = self
-                    .reservation
-                    .split(get_reserved_byte_for_record_batch(&batch));
-
-                // Sort the batch inline.
-                let input = self.sort_batch_stream(batch, metrics, reservation)?;
-                Ok(input)
-            })
-            .collect::<Result<_>>()?;
-
-        let expressions: LexOrdering = self.expr.iter().cloned().collect();
-
-        StreamingMergeBuilder::new()
-            .with_streams(streams)
-            .with_schema(Arc::clone(&self.schema))
-            .with_expressions(expressions.as_ref())
-            .with_metrics(metrics)
-            .with_batch_size(self.batch_size)
-            .with_fetch(None)
-            .with_reservation(self.merge_reservation.new_empty())
-            .build()
+        // Because batches are all in memory, we can sort them in place
+        // Concatenate memory batches together and sort
+        let batch = concat_batches(&self.schema, &self.in_mem_batches)?;
+        self.in_mem_batches.clear();
+        self.reservation
+            .try_resize(get_reserved_byte_for_record_batch(&batch))?;
+        let reservation = self.reservation.take();
+        self.sort_batch_stream(batch, metrics, reservation)
     }

Dandandan · 2025-04-15T08:47:00Z

I think concat followed by sort is slower in some cases because

Concat involves copying the entire batch (rather than only the keys to be sorted)
sort_batch_stream Can be slower as lexsort_to_indices is in cases with many columns slower than the Row Format

I think for ExternalSorter we don't want any additional parallelism as the sort is already executed per partition (so additional parallelism is likely to hurt rather than help).

The core improvements that I think are important:

Minimizing copying of the input batches to one (only once for the output)
Sorting once on the input batches rather than sort followed by merge
A good heuristic on when to switch from lexsort_to_indices-like sorting to RowConverter + sorting.

zhuqi-lucas · 2025-04-15T09:07:42Z

I think concat followed by sort is slower in some cases because

Concat involves copying the entire batch (rather than only the keys to be sorted)

sort_batch_stream Can be slower as lexsort_to_indices is in cases with many columns slower than the Row Format

I think for ExternalSorter we don't want any additional parallelism as the sort is already executed per partition (so additional parallelism is likely to hurt rather than help).

The core improvements that I think are important:

Minimizing copying of the input batches to one (only once for the output)

Sorting once on the input batches rather than sort followed by merge

A good heuristic on when to switch from lexsort_to_indices-like sorting to RowConverter + sorting.

Good explain.

I think for ExternalSorter we don't want any additional parallelism as the sort is already executed per partition (so additional parallelism is likely to hurt rather than help).

I see, the execute already using partition:

fn execute(
        &self,
        partition: usize,
        context: Arc<TaskContext>,
    ) -> Result<SendableRecordBatchStream> {

2010YOUY01 · 2025-04-15T09:46:57Z

I think for ExternalSorter we don't want any additional parallelism as the sort is already executed per partition (so additional parallelism is likely to hurt rather than help).

In this case, the final merging might become the bottleneck, because SPM does not have internal parallelism either, during the final merge only 1 core is busy.
I think 2 stages of sort-preserving merge is still needed, becuase ExternalSorter is blocking, but SPM is not, this setup can keep all the cores busy after partial sort is finished.
We just have to ensure they don't have a very large merge degree to become slow (with the optimizations like this PR)

Dandandan · 2025-04-15T10:06:39Z

I think for ExternalSorter we don't want any additional parallelism as the sort is already executed per partition (so additional parallelism is likely to hurt rather than help).

In this case, the final merging might become the bottleneck, because SPM does not have internal parallelism either, during the final merge only 1 core is busy. I think 2 stages of sort-preserving merge is still needed, becuase ExternalSorter is blocking, but SPM is not, this setup can keep all the cores busy after partial sort is finished. We just have to ensure they don't have a very large merge degree to become slow (with the optimizations like this PR)

Yes, to be clear I don't argue to remove SortPreservingMergeExec or sorting in two fases altogether or something similar, just was reacting to the idea of adding more parallelism in in_mem_sort_stream which probably won't help much.

SortPreserveMergeExec <= Does k-way merging based on input streams, with minimal memory overhead, maximizing input parallelism
     SortExec partitions[1,2,3,4,5,6,7,8,9,10] <= Performs in memory *sorting* if possible, for each input partition in parallel, only resorting to spill/merge when does not fit into memory

zhuqi-lucas · 2025-04-15T10:26:08Z

Thank you @2010YOUY01 @Dandandan , it's very interesting, i am thinking:

Since the all batch size sum is fixed, we can first calculate the compute size of each partition, call it partition_cal_size.
Then we setting a min_sort_size and max_sort_size, so we will determine the final_merged_batch_size:

final_merged_batch_size = 
  if (partition_cal_size < min_sort_size) => min_sort_size
  else if (partition_cal_size > max_sort_size) => max_sort_size
  else => partition_cal_size

This prevents creating too many small batches (which can fragment merge tasks) or overly large batches.
It looks like the first version of heuristic

But how can we calculate the min_sort_size and max_sort_size?

I think for ExternalSorter we don't want any additional parallelism as the sort is already executed per partition (so additional parallelism is likely to hurt rather than help).

In this case, the final merging might become the bottleneck, because SPM does not have internal parallelism either, during the final merge only 1 core is busy. I think 2 stages of sort-preserving merge is still needed, becuase ExternalSorter is blocking, but SPM is not, this setup can keep all the cores busy after partial sort is finished. We just have to ensure they don't have a very large merge degree to become slow (with the optimizations like this PR)

Yes, to be clear I don't argue to remove SortPreservingMergeExec or sorting in two fases altogether or something similar, just was reacting to the idea of adding more parallelism in in_mem_sort_stream which probably won't help much.
SortPreserveMergeExec <= Does k-way merging based on input streams, with minimal memory overhead, maximizing input parallelism
     SortExec partitions[1,2,3,4,5,6,7,8,9,10] <= Performs in memory *sorting* if possible, for each input partition in parallel, only resorting to spill/merge when does not fit into memory 

alamb · 2025-04-15T16:53:55Z

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.8.0-1016-gcp #18-Ubuntu SMP Fri Oct 4 22:16:29 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Comparing concat_batches_for_sort (6063bc5) to 0b01fdf diff
Benchmarks: clickbench_1 clickbench_partitioned sort_tpch
Results will be posted here when complete

alamb · 2025-04-15T16:54:34Z

🤖 ./gh_compare_branch.sh Benchmark Script Running Linux aal-dev 6.8.0-1016-gcp #18-Ubuntu SMP Fri Oct 4 22:16:29 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux Comparing concat_batches_for_sort (6063bc5) to 0b01fdf diff Benchmarks: clickbench_1 clickbench_partitioned sort_tpch1 Results will be posted here when complete

Thanks @alamb for this triggering, it seems stuck.

yeah, sorry I had a bug retriggered

alamb · 2025-04-15T16:56:53Z

I think the most efficient way would be to sort the indices to the arrays in one step followed by interleave, without either concat or sort followed by merge which would benefit the most from the built in sort algorithm and avoids copying the data.

I wonder if we can skip interleave / copying entirely?

Specifically, what if we sorted to indices, as you suggested, but then instead of calling interleave (which will copy the data) before sending it to merge_streams) maybe we could have some way to have the merge cursors also take the indicies -- so we could only copy data once 🤔

alamb · 2025-04-15T17:15:44Z

🤖: Benchmark completed

Details

Comparing HEAD and concat_batches_for_sort
--------------------
Benchmark clickbench_1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       HEAD ┃ concat_batches_for_sort ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     0.56ms │                  0.56ms │     no change │
│ QQuery 1     │    80.70ms │                 78.53ms │     no change │
│ QQuery 2     │   116.89ms │                114.69ms │     no change │
│ QQuery 3     │   130.38ms │                125.49ms │     no change │
│ QQuery 4     │   861.43ms │                756.05ms │ +1.14x faster │
│ QQuery 5     │   878.02ms │                869.13ms │     no change │
│ QQuery 6     │     0.67ms │                  0.64ms │     no change │
│ QQuery 7     │   100.33ms │                 93.51ms │ +1.07x faster │
│ QQuery 8     │   979.95ms │                956.50ms │     no change │
│ QQuery 9     │  1283.74ms │               1245.12ms │     no change │
│ QQuery 10    │   304.60ms │                306.39ms │     no change │
│ QQuery 11    │   342.54ms │                340.89ms │     no change │
│ QQuery 12    │   930.04ms │                933.37ms │     no change │
│ QQuery 13    │  1337.30ms │               1341.28ms │     no change │
│ QQuery 14    │   869.61ms │                883.04ms │     no change │
│ QQuery 15    │  1088.81ms │               1083.02ms │     no change │
│ QQuery 16    │  1841.74ms │               1788.14ms │     no change │
│ QQuery 17    │  1680.12ms │               1638.39ms │     no change │
│ QQuery 18    │  3128.65ms │               3139.26ms │     no change │
│ QQuery 19    │   127.46ms │                120.42ms │ +1.06x faster │
│ QQuery 20    │  1169.35ms │               1195.58ms │     no change │
│ QQuery 21    │  1472.42ms │               1457.18ms │     no change │
│ QQuery 22    │  2595.51ms │               2696.08ms │     no change │
│ QQuery 23    │  8475.08ms │               8735.96ms │     no change │
│ QQuery 24    │   510.59ms │                515.80ms │     no change │
│ QQuery 25    │   441.39ms │                439.44ms │     no change │
│ QQuery 26    │   569.36ms │                581.32ms │     no change │
│ QQuery 27    │  1850.31ms │               1844.63ms │     no change │
│ QQuery 28    │ 13503.59ms │              13185.12ms │     no change │
│ QQuery 29    │   587.04ms │                548.23ms │ +1.07x faster │
│ QQuery 30    │   872.06ms │                861.85ms │     no change │
│ QQuery 31    │   924.05ms │                992.86ms │  1.07x slower │
│ QQuery 32    │  2763.71ms │               2715.48ms │     no change │
│ QQuery 33    │  3455.95ms │               3450.90ms │     no change │
│ QQuery 34    │  3466.53ms │               3478.02ms │     no change │
│ QQuery 35    │  1342.93ms │               1336.22ms │     no change │
│ QQuery 36    │   179.89ms │                185.83ms │     no change │
│ QQuery 37    │   106.98ms │                102.42ms │     no change │
│ QQuery 38    │   169.57ms │                181.57ms │  1.07x slower │
│ QQuery 39    │   263.27ms │                261.33ms │     no change │
│ QQuery 40    │    90.39ms │                 86.94ms │     no change │
│ QQuery 41    │    85.04ms │                 83.29ms │     no change │
│ QQuery 42    │    78.33ms │                 78.86ms │     no change │
└──────────────┴────────────┴─────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 61056.89ms │
│ Total Time (concat_batches_for_sort)   │ 60829.33ms │
│ Average Time (HEAD)                    │  1419.93ms │
│ Average Time (concat_batches_for_sort) │  1414.64ms │
│ Queries Faster                         │          4 │
│ Queries Slower                         │          2 │
│ Queries with No Change                 │         37 │
└────────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃       HEAD ┃ concat_batches_for_sort ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.39ms │                  2.57ms │ 1.07x slower │
│ QQuery 1     │    36.11ms │                 36.43ms │    no change │
│ QQuery 2     │    91.64ms │                 90.30ms │    no change │
│ QQuery 3     │    97.91ms │                 96.78ms │    no change │
│ QQuery 4     │   732.04ms │                776.35ms │ 1.06x slower │
│ QQuery 5     │   833.38ms │                872.17ms │    no change │
│ QQuery 6     │     2.10ms │                  2.26ms │ 1.07x slower │
│ QQuery 7     │    41.00ms │                 39.72ms │    no change │
│ QQuery 8     │   942.92ms │                946.92ms │    no change │
│ QQuery 9     │  1216.74ms │               1198.20ms │    no change │
│ QQuery 10    │   268.97ms │                280.66ms │    no change │
│ QQuery 11    │   302.80ms │                311.61ms │    no change │
│ QQuery 12    │   908.31ms │                943.03ms │    no change │
│ QQuery 13    │  1238.75ms │               1400.87ms │ 1.13x slower │
│ QQuery 14    │   862.33ms │                880.29ms │    no change │
│ QQuery 15    │  1072.63ms │               1053.58ms │    no change │
│ QQuery 16    │  1755.68ms │               1765.66ms │    no change │
│ QQuery 17    │  1653.94ms │               1624.45ms │    no change │
│ QQuery 18    │  3109.98ms │               3131.55ms │    no change │
│ QQuery 19    │    86.54ms │                 85.77ms │    no change │
│ QQuery 20    │  1148.95ms │               1170.51ms │    no change │
│ QQuery 21    │  1333.80ms │               1392.35ms │    no change │
│ QQuery 22    │  2373.02ms │               2456.49ms │    no change │
│ QQuery 23    │  8411.11ms │               8608.23ms │    no change │
│ QQuery 24    │   469.47ms │                488.99ms │    no change │
│ QQuery 25    │   399.27ms │                411.12ms │    no change │
│ QQuery 26    │   537.55ms │                546.69ms │    no change │
│ QQuery 27    │  1685.85ms │               1739.53ms │    no change │
│ QQuery 28    │ 12957.91ms │              12866.18ms │    no change │
│ QQuery 29    │   546.32ms │                542.10ms │    no change │
│ QQuery 30    │   846.79ms │                852.23ms │    no change │
│ QQuery 31    │   887.81ms │                891.51ms │    no change │
│ QQuery 32    │  2723.16ms │               2728.31ms │    no change │
│ QQuery 33    │  3360.97ms │               3394.86ms │    no change │
│ QQuery 34    │  3409.45ms │               3368.33ms │    no change │
│ QQuery 35    │  1284.03ms │               1297.77ms │    no change │
│ QQuery 36    │   127.54ms │                133.40ms │    no change │
│ QQuery 37    │    56.79ms │                 57.49ms │    no change │
│ QQuery 38    │   129.27ms │                127.87ms │    no change │
│ QQuery 39    │   211.07ms │                209.72ms │    no change │
│ QQuery 40    │    49.05ms │                 51.83ms │ 1.06x slower │
│ QQuery 41    │    47.29ms │                 46.37ms │    no change │
│ QQuery 42    │    39.37ms │                 39.91ms │    no change │
└──────────────┴────────────┴─────────────────────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 58292.02ms │
│ Total Time (concat_batches_for_sort)   │ 58960.95ms │
│ Average Time (HEAD)                    │  1355.63ms │
│ Average Time (concat_batches_for_sort) │  1371.18ms │
│ Queries Faster                         │          0 │
│ Queries Slower                         │          5 │
│ Queries with No Change                 │         38 │
└────────────────────────────────────────┴────────────┘
--------------------
Benchmark sort_tpch.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ concat_batches_for_sort ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ Q1           │  376.52ms │                326.59ms │ +1.15x faster │
│ Q2           │  307.50ms │                283.70ms │ +1.08x faster │
│ Q3           │ 1217.16ms │               1216.87ms │     no change │
│ Q4           │  430.52ms │                476.70ms │  1.11x slower │
│ Q5           │  433.11ms │                499.16ms │  1.15x slower │
│ Q6           │  467.50ms │                521.98ms │  1.12x slower │
│ Q7           │  960.82ms │               1016.45ms │  1.06x slower │
│ Q8           │  793.48ms │                842.76ms │  1.06x slower │
│ Q9           │  833.53ms │                862.74ms │     no change │
│ Q10          │ 1275.22ms │               1279.19ms │     no change │
│ Q11          │  771.89ms │                762.28ms │     no change │
└──────────────┴───────────┴─────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 7867.25ms │
│ Total Time (concat_batches_for_sort)   │ 8088.43ms │
│ Average Time (HEAD)                    │  715.20ms │
│ Average Time (concat_batches_for_sort) │  735.31ms │
│ Queries Faster                         │         2 │
│ Queries Slower                         │         5 │
│ Queries with No Change                 │         4 │
└────────────────────────────────────────┴───────────┘

zhuqi-lucas · 2025-04-16T03:11:54Z

I think the most efficient way would be to sort the indices to the arrays in one step followed by interleave, without either concat or sort followed by merge which would benefit the most from the built in sort algorithm and avoids copying the data.

I wonder if we can skip interleave / copying entirely?

Specifically, what if we sorted to indices, as you suggested, but then instead of calling interleave (which will copy the data) before sending it to merge_streams) maybe we could have some way to have the merge cursors also take the indicies -- so we could only copy data once 🤔

Thanks @alamb , it looks promising.

zhuqi-lucas · 2025-04-16T03:15:12Z

🤖: Benchmark completed

Details

Comparing HEAD and concat_batches_for_sort
--------------------
Benchmark clickbench_1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       HEAD ┃ concat_batches_for_sort ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     0.56ms │                  0.56ms │     no change │
│ QQuery 1     │    80.70ms │                 78.53ms │     no change │
│ QQuery 2     │   116.89ms │                114.69ms │     no change │
│ QQuery 3     │   130.38ms │                125.49ms │     no change │
│ QQuery 4     │   861.43ms │                756.05ms │ +1.14x faster │
│ QQuery 5     │   878.02ms │                869.13ms │     no change │
│ QQuery 6     │     0.67ms │                  0.64ms │     no change │
│ QQuery 7     │   100.33ms │                 93.51ms │ +1.07x faster │
│ QQuery 8     │   979.95ms │                956.50ms │     no change │
│ QQuery 9     │  1283.74ms │               1245.12ms │     no change │
│ QQuery 10    │   304.60ms │                306.39ms │     no change │
│ QQuery 11    │   342.54ms │                340.89ms │     no change │
│ QQuery 12    │   930.04ms │                933.37ms │     no change │
│ QQuery 13    │  1337.30ms │               1341.28ms │     no change │
│ QQuery 14    │   869.61ms │                883.04ms │     no change │
│ QQuery 15    │  1088.81ms │               1083.02ms │     no change │
│ QQuery 16    │  1841.74ms │               1788.14ms │     no change │
│ QQuery 17    │  1680.12ms │               1638.39ms │     no change │
│ QQuery 18    │  3128.65ms │               3139.26ms │     no change │
│ QQuery 19    │   127.46ms │                120.42ms │ +1.06x faster │
│ QQuery 20    │  1169.35ms │               1195.58ms │     no change │
│ QQuery 21    │  1472.42ms │               1457.18ms │     no change │
│ QQuery 22    │  2595.51ms │               2696.08ms │     no change │
│ QQuery 23    │  8475.08ms │               8735.96ms │     no change │
│ QQuery 24    │   510.59ms │                515.80ms │     no change │
│ QQuery 25    │   441.39ms │                439.44ms │     no change │
│ QQuery 26    │   569.36ms │                581.32ms │     no change │
│ QQuery 27    │  1850.31ms │               1844.63ms │     no change │
│ QQuery 28    │ 13503.59ms │              13185.12ms │     no change │
│ QQuery 29    │   587.04ms │                548.23ms │ +1.07x faster │
│ QQuery 30    │   872.06ms │                861.85ms │     no change │
│ QQuery 31    │   924.05ms │                992.86ms │  1.07x slower │
│ QQuery 32    │  2763.71ms │               2715.48ms │     no change │
│ QQuery 33    │  3455.95ms │               3450.90ms │     no change │
│ QQuery 34    │  3466.53ms │               3478.02ms │     no change │
│ QQuery 35    │  1342.93ms │               1336.22ms │     no change │
│ QQuery 36    │   179.89ms │                185.83ms │     no change │
│ QQuery 37    │   106.98ms │                102.42ms │     no change │
│ QQuery 38    │   169.57ms │                181.57ms │  1.07x slower │
│ QQuery 39    │   263.27ms │                261.33ms │     no change │
│ QQuery 40    │    90.39ms │                 86.94ms │     no change │
│ QQuery 41    │    85.04ms │                 83.29ms │     no change │
│ QQuery 42    │    78.33ms │                 78.86ms │     no change │
└──────────────┴────────────┴─────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 61056.89ms │
│ Total Time (concat_batches_for_sort)   │ 60829.33ms │
│ Average Time (HEAD)                    │  1419.93ms │
│ Average Time (concat_batches_for_sort) │  1414.64ms │
│ Queries Faster                         │          4 │
│ Queries Slower                         │          2 │
│ Queries with No Change                 │         37 │
└────────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃       HEAD ┃ concat_batches_for_sort ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.39ms │                  2.57ms │ 1.07x slower │
│ QQuery 1     │    36.11ms │                 36.43ms │    no change │
│ QQuery 2     │    91.64ms │                 90.30ms │    no change │
│ QQuery 3     │    97.91ms │                 96.78ms │    no change │
│ QQuery 4     │   732.04ms │                776.35ms │ 1.06x slower │
│ QQuery 5     │   833.38ms │                872.17ms │    no change │
│ QQuery 6     │     2.10ms │                  2.26ms │ 1.07x slower │
│ QQuery 7     │    41.00ms │                 39.72ms │    no change │
│ QQuery 8     │   942.92ms │                946.92ms │    no change │
│ QQuery 9     │  1216.74ms │               1198.20ms │    no change │
│ QQuery 10    │   268.97ms │                280.66ms │    no change │
│ QQuery 11    │   302.80ms │                311.61ms │    no change │
│ QQuery 12    │   908.31ms │                943.03ms │    no change │
│ QQuery 13    │  1238.75ms │               1400.87ms │ 1.13x slower │
│ QQuery 14    │   862.33ms │                880.29ms │    no change │
│ QQuery 15    │  1072.63ms │               1053.58ms │    no change │
│ QQuery 16    │  1755.68ms │               1765.66ms │    no change │
│ QQuery 17    │  1653.94ms │               1624.45ms │    no change │
│ QQuery 18    │  3109.98ms │               3131.55ms │    no change │
│ QQuery 19    │    86.54ms │                 85.77ms │    no change │
│ QQuery 20    │  1148.95ms │               1170.51ms │    no change │
│ QQuery 21    │  1333.80ms │               1392.35ms │    no change │
│ QQuery 22    │  2373.02ms │               2456.49ms │    no change │
│ QQuery 23    │  8411.11ms │               8608.23ms │    no change │
│ QQuery 24    │   469.47ms │                488.99ms │    no change │
│ QQuery 25    │   399.27ms │                411.12ms │    no change │
│ QQuery 26    │   537.55ms │                546.69ms │    no change │
│ QQuery 27    │  1685.85ms │               1739.53ms │    no change │
│ QQuery 28    │ 12957.91ms │              12866.18ms │    no change │
│ QQuery 29    │   546.32ms │                542.10ms │    no change │
│ QQuery 30    │   846.79ms │                852.23ms │    no change │
│ QQuery 31    │   887.81ms │                891.51ms │    no change │
│ QQuery 32    │  2723.16ms │               2728.31ms │    no change │
│ QQuery 33    │  3360.97ms │               3394.86ms │    no change │
│ QQuery 34    │  3409.45ms │               3368.33ms │    no change │
│ QQuery 35    │  1284.03ms │               1297.77ms │    no change │
│ QQuery 36    │   127.54ms │                133.40ms │    no change │
│ QQuery 37    │    56.79ms │                 57.49ms │    no change │
│ QQuery 38    │   129.27ms │                127.87ms │    no change │
│ QQuery 39    │   211.07ms │                209.72ms │    no change │
│ QQuery 40    │    49.05ms │                 51.83ms │ 1.06x slower │
│ QQuery 41    │    47.29ms │                 46.37ms │    no change │
│ QQuery 42    │    39.37ms │                 39.91ms │    no change │
└──────────────┴────────────┴─────────────────────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 58292.02ms │
│ Total Time (concat_batches_for_sort)   │ 58960.95ms │
│ Average Time (HEAD)                    │  1355.63ms │
│ Average Time (concat_batches_for_sort) │  1371.18ms │
│ Queries Faster                         │          0 │
│ Queries Slower                         │          5 │
│ Queries with No Change                 │         38 │
└────────────────────────────────────────┴────────────┘
--------------------
Benchmark sort_tpch.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ concat_batches_for_sort ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ Q1           │  376.52ms │                326.59ms │ +1.15x faster │
│ Q2           │  307.50ms │                283.70ms │ +1.08x faster │
│ Q3           │ 1217.16ms │               1216.87ms │     no change │
│ Q4           │  430.52ms │                476.70ms │  1.11x slower │
│ Q5           │  433.11ms │                499.16ms │  1.15x slower │
│ Q6           │  467.50ms │                521.98ms │  1.12x slower │
│ Q7           │  960.82ms │               1016.45ms │  1.06x slower │
│ Q8           │  793.48ms │                842.76ms │  1.06x slower │
│ Q9           │  833.53ms │                862.74ms │     no change │
│ Q10          │ 1275.22ms │               1279.19ms │     no change │
│ Q11          │  771.89ms │                762.28ms │     no change │
└──────────────┴───────────┴─────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 7867.25ms │
│ Total Time (concat_batches_for_sort)   │ 8088.43ms │
│ Average Time (HEAD)                    │  715.20ms │
│ Average Time (concat_batches_for_sort) │  735.31ms │
│ Queries Faster                         │         2 │
│ Queries Slower                         │         5 │
│ Queries with No Change                 │         4 │
└────────────────────────────────────────┴───────────┘

No performance improvement for benchmark, i believe mostly the benchmark batch size > sort_in_place size, it will not gain from this PR. Sort-tpch 10 should gain performance not in this benchmark list.

alamb · 2025-05-16T16:51:59Z

I am confused why my benchmark for local Mac no regression for sort-tpch Q3, but the generated benchmark for linux we can reproduce the regression.

It may be that the tests are unstable or that the memory / code made for x86 was different than for aarch.

I'll run the benchmarks again to see if they are reproducable

alamb · 2025-05-16T20:11:22Z

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.11.0-1013-gcp #13~24.04.1-Ubuntu SMP Wed Apr 2 16:34:16 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing concat_batches_for_sort (6a3b4e7) to 061ee09 diff
Benchmarks: sort_tpch
Results will be posted here when complete

alamb · 2025-05-16T20:23:21Z

🤖: Benchmark completed

Details

Comparing HEAD and concat_batches_for_sort
--------------------
Benchmark sort_tpch.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ concat_batches_for_sort ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ Q1           │  376.37ms │                314.89ms │ +1.20x faster │
│ Q2           │  301.68ms │                255.82ms │ +1.18x faster │
│ Q3           │ 1185.51ms │               1509.66ms │  1.27x slower │
│ Q4           │  423.87ms │                378.59ms │ +1.12x faster │
│ Q5           │  428.38ms │                428.09ms │     no change │
│ Q6           │  453.02ms │                465.83ms │     no change │
│ Q7           │  945.04ms │                937.59ms │     no change │
│ Q8           │  774.15ms │                785.76ms │     no change │
│ Q9           │  817.33ms │                813.15ms │     no change │
│ Q10          │ 1248.57ms │               1247.79ms │     no change │
│ Q11          │  752.69ms │                756.79ms │     no change │
└──────────────┴───────────┴─────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 7706.60ms │
│ Total Time (concat_batches_for_sort)   │ 7893.95ms │
│ Average Time (HEAD)                    │  700.60ms │
│ Average Time (concat_batches_for_sort) │  717.63ms │
│ Queries Faster                         │         3 │
│ Queries Slower                         │         1 │
│ Queries with No Change                 │         7 │
└────────────────────────────────────────┴───────────┘

alamb · 2025-05-16T20:23:25Z

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.11.0-1013-gcp #13~24.04.1-Ubuntu SMP Wed Apr 2 16:34:16 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing concat_batches_for_sort (6a3b4e7) to 061ee09 diff
Benchmarks: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

alamb · 2025-05-16T20:55:35Z

🤖: Benchmark completed

Details

Comparing HEAD and concat_batches_for_sort
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       HEAD ┃ concat_batches_for_sort ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │  1934.54ms │               1900.81ms │     no change │
│ QQuery 1     │   751.10ms │                701.42ms │ +1.07x faster │
│ QQuery 2     │  1493.18ms │               1483.50ms │     no change │
│ QQuery 3     │   696.07ms │                716.72ms │     no change │
│ QQuery 4     │  1482.45ms │               1476.73ms │     no change │
│ QQuery 5     │ 15747.34ms │              15207.56ms │     no change │
│ QQuery 6     │  2121.22ms │               2096.63ms │     no change │
│ QQuery 7     │  2731.09ms │               3078.41ms │  1.13x slower │
└──────────────┴────────────┴─────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 26956.99ms │
│ Total Time (concat_batches_for_sort)   │ 26661.79ms │
│ Average Time (HEAD)                    │  3369.62ms │
│ Average Time (concat_batches_for_sort) │  3332.72ms │
│ Queries Faster                         │          1 │
│ Queries Slower                         │          1 │
│ Queries with No Change                 │          6 │
└────────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query        ┃       HEAD ┃ concat_batches_for_sort ┃    Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 0     │     2.23ms │                  2.17ms │ no change │
│ QQuery 1     │    38.55ms │                 36.90ms │ no change │
│ QQuery 2     │    96.10ms │                 92.71ms │ no change │
│ QQuery 3     │   100.55ms │                100.24ms │ no change │
│ QQuery 4     │   783.53ms │                754.68ms │ no change │
│ QQuery 5     │   883.54ms │                855.03ms │ no change │
│ QQuery 6     │     2.30ms │                  2.33ms │ no change │
│ QQuery 7     │    44.31ms │                 44.58ms │ no change │
│ QQuery 8     │   918.70ms │                917.11ms │ no change │
│ QQuery 9     │  1252.38ms │               1238.01ms │ no change │
│ QQuery 10    │   268.08ms │                272.34ms │ no change │
│ QQuery 11    │   309.54ms │                318.51ms │ no change │
│ QQuery 12    │   936.16ms │                930.35ms │ no change │
│ QQuery 13    │  1343.77ms │               1399.95ms │ no change │
│ QQuery 14    │   888.72ms │                862.03ms │ no change │
│ QQuery 15    │  1049.02ms │               1034.38ms │ no change │
│ QQuery 16    │  1752.46ms │               1792.38ms │ no change │
│ QQuery 17    │  1619.63ms │               1645.35ms │ no change │
│ QQuery 18    │  3110.27ms │               3121.64ms │ no change │
│ QQuery 19    │    85.81ms │                 87.17ms │ no change │
│ QQuery 20    │  1169.95ms │               1139.00ms │ no change │
│ QQuery 21    │  1374.77ms │               1350.64ms │ no change │
│ QQuery 22    │  2282.36ms │               2216.96ms │ no change │
│ QQuery 23    │  8695.65ms │               8554.71ms │ no change │
│ QQuery 24    │   484.16ms │                473.97ms │ no change │
│ QQuery 25    │   404.55ms │                391.54ms │ no change │
│ QQuery 26    │   553.76ms │                533.66ms │ no change │
│ QQuery 27    │  1729.92ms │               1692.14ms │ no change │
│ QQuery 28    │ 12725.07ms │              12727.03ms │ no change │
│ QQuery 29    │   544.10ms │                532.18ms │ no change │
│ QQuery 30    │   835.42ms │                838.05ms │ no change │
│ QQuery 31    │   895.50ms │                881.34ms │ no change │
│ QQuery 32    │  2756.79ms │               2712.39ms │ no change │
│ QQuery 33    │  3414.72ms │               3417.64ms │ no change │
│ QQuery 34    │  3474.86ms │               3468.37ms │ no change │
│ QQuery 35    │  1314.89ms │               1344.35ms │ no change │
│ QQuery 36    │   126.84ms │                130.98ms │ no change │
│ QQuery 37    │    58.10ms │                 58.90ms │ no change │
│ QQuery 38    │   129.80ms │                126.10ms │ no change │
│ QQuery 39    │   208.05ms │                206.65ms │ no change │
│ QQuery 40    │    48.60ms │                 50.97ms │ no change │
│ QQuery 41    │    49.91ms │                 49.13ms │ no change │
│ QQuery 42    │    39.51ms │                 40.19ms │ no change │
└──────────────┴────────────┴─────────────────────────┴───────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 58802.96ms │
│ Total Time (concat_batches_for_sort)   │ 58444.75ms │
│ Average Time (HEAD)                    │  1367.51ms │
│ Average Time (concat_batches_for_sort) │  1359.18ms │
│ Queries Faster                         │          0 │
│ Queries Slower                         │          0 │
│ Queries with No Change                 │         43 │
└────────────────────────────────────────┴────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃     HEAD ┃ concat_batches_for_sort ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 125.83ms │                122.43ms │     no change │
│ QQuery 2     │  23.73ms │                 25.20ms │  1.06x slower │
│ QQuery 3     │  35.76ms │                 35.65ms │     no change │
│ QQuery 4     │  20.28ms │                 20.41ms │     no change │
│ QQuery 5     │  55.79ms │                 54.72ms │     no change │
│ QQuery 6     │  12.42ms │                 12.41ms │     no change │
│ QQuery 7     │ 104.83ms │                106.99ms │     no change │
│ QQuery 8     │  27.77ms │                 26.27ms │ +1.06x faster │
│ QQuery 9     │  64.73ms │                 63.20ms │     no change │
│ QQuery 10    │  57.35ms │                 59.56ms │     no change │
│ QQuery 11    │  13.05ms │                 12.75ms │     no change │
│ QQuery 12    │  44.79ms │                 45.71ms │     no change │
│ QQuery 13    │  29.42ms │                 30.45ms │     no change │
│ QQuery 14    │  10.26ms │                 11.56ms │  1.13x slower │
│ QQuery 15    │  25.83ms │                 25.97ms │     no change │
│ QQuery 16    │  23.39ms │                 23.68ms │     no change │
│ QQuery 17    │  96.59ms │                 97.15ms │     no change │
│ QQuery 18    │ 233.41ms │                240.00ms │     no change │
│ QQuery 19    │  29.07ms │                 27.45ms │ +1.06x faster │
│ QQuery 20    │  39.07ms │                 37.98ms │     no change │
│ QQuery 21    │ 170.26ms │                177.23ms │     no change │
│ QQuery 22    │  17.79ms │                 17.68ms │     no change │
└──────────────┴──────────┴─────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 1261.44ms │
│ Total Time (concat_batches_for_sort)   │ 1274.46ms │
│ Average Time (HEAD)                    │   57.34ms │
│ Average Time (concat_batches_for_sort) │   57.93ms │
│ Queries Faster                         │         2 │
│ Queries Slower                         │         2 │
│ Queries with No Change                 │        18 │
└────────────────────────────────────────┴───────────┘

Dandandan · 2025-05-17T14:13:34Z

I am confused why my benchmark for local Mac no regression for sort-tpch Q3, but the generated benchmark for linux we can reproduce the regression.

I was wondering, maybe we only add the "concat-arrays-only-instead-of-full-record-batch" optimization and leave the rest for what it is (for now)? So don't change the if self.reservation.size() < self.sort_in_place_threshold_bytes condition to use different heuristics and keep sort_in_place_threshold_bytes to 1024 * 1024.

It might be of smaler significance, but it is more likely to have no or smaller regressions (and we can follow-up with better heuristics that work accros different machines).

zhuqi-lucas · 2025-05-17T14:19:51Z

I am confused why my benchmark for local Mac no regression for sort-tpch Q3, but the generated benchmark for linux we can reproduce the regression.

It may be that the tests are unstable or that the memory / code made for x86 was different than for aarch.

I'll run the benchmarks again to see if they are reproducable

Thank you @alamb , the result still regression for linux run.

I am confused why my benchmark for local Mac no regression for sort-tpch Q3, but the generated benchmark for linux we can reproduce the regression.

I was wondering, maybe we only add the "concat-arrays-only-instead-of-full-record-batch" optimization and leave the rest for what it is (for now)? So don't change the if self.reservation.size() < self.sort_in_place_threshold_bytes condition to use different heuristics and keep sort_in_place_threshold_bytes to 1024 * 1024.

It might be of smaler significance, but it is more likely to have no or smaller regressions (and we can follow-up with better heuristics that work accros different machines).

Good point @Dandandan , let me do the smallest optimization first, i will address it.

…sort

zhuqi-lucas · 2025-05-17T14:46:52Z

Thank you @Dandandan @alamb ,
Addressed it in latest PR, it should be no regression now.

alamb · 2025-05-19T17:47:03Z

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.11.0-1013-gcp #13~24.04.1-Ubuntu SMP Wed Apr 2 16:34:16 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing concat_batches_for_sort (223b9d9) to 3e30f77 diff
Benchmarks: sort_tpch
Results will be posted here when complete

alamb · 2025-05-19T17:59:35Z

🤖: Benchmark completed

Details

Comparing HEAD and concat_batches_for_sort
--------------------
Benchmark sort_tpch.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ concat_batches_for_sort ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ Q1           │  333.47ms │                375.58ms │  1.13x slower │
│ Q2           │  311.42ms │                281.20ms │ +1.11x faster │
│ Q3           │ 1145.84ms │               1160.57ms │     no change │
│ Q4           │  432.80ms │                430.23ms │     no change │
│ Q5           │  427.81ms │                428.36ms │     no change │
│ Q6           │  472.14ms │                466.34ms │     no change │
│ Q7           │  934.81ms │                921.22ms │     no change │
│ Q8           │  788.04ms │                797.93ms │     no change │
│ Q9           │  829.73ms │                843.10ms │     no change │
│ Q10          │ 1210.86ms │               1222.32ms │     no change │
│ Q11          │  721.46ms │                766.80ms │  1.06x slower │
└──────────────┴───────────┴─────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 7608.39ms │
│ Total Time (concat_batches_for_sort)   │ 7693.64ms │
│ Average Time (HEAD)                    │  691.67ms │
│ Average Time (concat_batches_for_sort) │  699.42ms │
│ Queries Faster                         │         1 │
│ Queries Slower                         │         2 │
│ Queries with No Change                 │         8 │
└────────────────────────────────────────┴───────────┘

alamb · 2025-05-19T17:59:38Z

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.11.0-1013-gcp #13~24.04.1-Ubuntu SMP Wed Apr 2 16:34:16 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing concat_batches_for_sort (223b9d9) to 3e30f77 diff
Benchmarks: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

alamb · 2025-05-19T18:31:05Z

🤖: Benchmark completed

Details

Comparing HEAD and concat_batches_for_sort
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query        ┃       HEAD ┃ concat_batches_for_sort ┃    Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 0     │  1853.88ms │               1909.03ms │ no change │
│ QQuery 1     │   716.99ms │                715.33ms │ no change │
│ QQuery 2     │  1437.92ms │               1440.95ms │ no change │
│ QQuery 3     │   696.19ms │                718.99ms │ no change │
│ QQuery 4     │  1449.28ms │               1461.15ms │ no change │
│ QQuery 5     │ 15292.37ms │              15476.95ms │ no change │
│ QQuery 6     │  2034.50ms │               2041.33ms │ no change │
│ QQuery 7     │  2098.67ms │               2096.32ms │ no change │
└──────────────┴────────────┴─────────────────────────┴───────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 25579.80ms │
│ Total Time (concat_batches_for_sort)   │ 25860.05ms │
│ Average Time (HEAD)                    │  3197.47ms │
│ Average Time (concat_batches_for_sort) │  3232.51ms │
│ Queries Faster                         │          0 │
│ Queries Slower                         │          0 │
│ Queries with No Change                 │          8 │
└────────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       HEAD ┃ concat_batches_for_sort ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.27ms │                  2.24ms │     no change │
│ QQuery 1     │    33.80ms │                 33.24ms │     no change │
│ QQuery 2     │    81.15ms │                 82.06ms │     no change │
│ QQuery 3     │    99.55ms │                 95.85ms │     no change │
│ QQuery 4     │   622.18ms │                580.23ms │ +1.07x faster │
│ QQuery 5     │   861.79ms │                846.12ms │     no change │
│ QQuery 6     │     2.29ms │                  2.26ms │     no change │
│ QQuery 7     │    40.57ms │                 37.96ms │ +1.07x faster │
│ QQuery 8     │   908.51ms │                901.02ms │     no change │
│ QQuery 9     │  1194.29ms │               1193.39ms │     no change │
│ QQuery 10    │   262.64ms │                263.88ms │     no change │
│ QQuery 11    │   300.75ms │                298.34ms │     no change │
│ QQuery 12    │   898.63ms │                919.13ms │     no change │
│ QQuery 13    │  1384.37ms │               1341.06ms │     no change │
│ QQuery 14    │   838.27ms │                826.54ms │     no change │
│ QQuery 15    │   818.35ms │                838.96ms │     no change │
│ QQuery 16    │  1730.53ms │               1737.15ms │     no change │
│ QQuery 17    │  1605.75ms │               1613.31ms │     no change │
│ QQuery 18    │  3095.90ms │               3076.04ms │     no change │
│ QQuery 19    │    86.19ms │                 81.71ms │ +1.05x faster │
│ QQuery 20    │  1149.07ms │               1140.51ms │     no change │
│ QQuery 21    │  1336.01ms │               1317.02ms │     no change │
│ QQuery 22    │  2187.00ms │               2217.02ms │     no change │
│ QQuery 23    │  8069.32ms │               8165.82ms │     no change │
│ QQuery 24    │   475.17ms │                465.34ms │     no change │
│ QQuery 25    │   396.18ms │                394.18ms │     no change │
│ QQuery 26    │   533.64ms │                540.67ms │     no change │
│ QQuery 27    │  1608.41ms │               1571.15ms │     no change │
│ QQuery 28    │ 13090.47ms │              12353.69ms │ +1.06x faster │
│ QQuery 29    │   523.63ms │                526.27ms │     no change │
│ QQuery 30    │   813.95ms │                822.18ms │     no change │
│ QQuery 31    │   853.08ms │                853.21ms │     no change │
│ QQuery 32    │  2696.12ms │               2652.83ms │     no change │
│ QQuery 33    │  3361.98ms │               3319.28ms │     no change │
│ QQuery 34    │  3412.54ms │               3376.67ms │     no change │
│ QQuery 35    │  1318.71ms │               1281.69ms │     no change │
│ QQuery 36    │   128.69ms │                122.04ms │ +1.05x faster │
│ QQuery 37    │    58.63ms │                 55.16ms │ +1.06x faster │
│ QQuery 38    │   124.19ms │                127.18ms │     no change │
│ QQuery 39    │   200.50ms │                194.61ms │     no change │
│ QQuery 40    │    46.81ms │                 48.75ms │     no change │
│ QQuery 41    │    44.37ms │                 45.22ms │     no change │
│ QQuery 42    │    38.09ms │                 38.91ms │     no change │
└──────────────┴────────────┴─────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 57334.34ms │
│ Total Time (concat_batches_for_sort)   │ 56399.90ms │
│ Average Time (HEAD)                    │  1333.36ms │
│ Average Time (concat_batches_for_sort) │  1311.63ms │
│ Queries Faster                         │          6 │
│ Queries Slower                         │          0 │
│ Queries with No Change                 │         37 │
└────────────────────────────────────────┴────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃     HEAD ┃ concat_batches_for_sort ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 1     │ 125.18ms │                122.61ms │    no change │
│ QQuery 2     │  23.85ms │                 23.73ms │    no change │
│ QQuery 3     │  34.98ms │                 34.47ms │    no change │
│ QQuery 4     │  20.64ms │                 20.64ms │    no change │
│ QQuery 5     │  54.68ms │                 54.43ms │    no change │
│ QQuery 6     │  12.45ms │                 12.28ms │    no change │
│ QQuery 7     │ 107.92ms │                103.45ms │    no change │
│ QQuery 8     │  25.80ms │                 26.69ms │    no change │
│ QQuery 9     │  62.37ms │                 61.01ms │    no change │
│ QQuery 10    │  57.63ms │                 58.14ms │    no change │
│ QQuery 11    │  12.93ms │                 13.54ms │    no change │
│ QQuery 12    │  43.64ms │                 47.78ms │ 1.09x slower │
│ QQuery 13    │  30.68ms │                 30.12ms │    no change │
│ QQuery 14    │  10.10ms │                 10.02ms │    no change │
│ QQuery 15    │  26.05ms │                 25.20ms │    no change │
│ QQuery 16    │  22.64ms │                 23.33ms │    no change │
│ QQuery 17    │ 100.13ms │                103.06ms │    no change │
│ QQuery 18    │ 236.67ms │                238.11ms │    no change │
│ QQuery 19    │  27.18ms │                 28.47ms │    no change │
│ QQuery 20    │  39.12ms │                 41.04ms │    no change │
│ QQuery 21    │ 168.75ms │                166.57ms │    no change │
│ QQuery 22    │  17.34ms │                 16.86ms │    no change │
└──────────────┴──────────┴─────────────────────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 1260.73ms │
│ Total Time (concat_batches_for_sort)   │ 1261.52ms │
│ Average Time (HEAD)                    │   57.31ms │
│ Average Time (concat_batches_for_sort) │   57.34ms │
│ Queries Faster                         │         0 │
│ Queries Slower                         │         1 │
│ Queries with No Change                 │        21 │
└────────────────────────────────────────┴───────────┘

Dandandan · 2025-05-19T19:38:35Z

datafusion/physical-plan/src/sorts/sort.rs

+        // Also, we only support sort expressions with less than 3 columns for now. Because from testing, when
+        // columns > 3, the performance of in-place sort is worse than sort/merge.
+        // Need to further investigate the performance of in-place sort when columns > 3.
+        if self.expr.len() <= 2


Shouldn't we remove this self.expr.len() <= 2 here? This wasn't here before

Thank you @Dandandan for review, there was regression for testing >2 cases.
Addressed in latest PR, let's see the result for smallest changes. Thanks.

zhuqi-lucas · 2025-05-20T02:50:57Z

🤖: Benchmark completed

Details

Comparing HEAD and concat_batches_for_sort
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query        ┃       HEAD ┃ concat_batches_for_sort ┃    Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 0     │  1853.88ms │               1909.03ms │ no change │
│ QQuery 1     │   716.99ms │                715.33ms │ no change │
│ QQuery 2     │  1437.92ms │               1440.95ms │ no change │
│ QQuery 3     │   696.19ms │                718.99ms │ no change │
│ QQuery 4     │  1449.28ms │               1461.15ms │ no change │
│ QQuery 5     │ 15292.37ms │              15476.95ms │ no change │
│ QQuery 6     │  2034.50ms │               2041.33ms │ no change │
│ QQuery 7     │  2098.67ms │               2096.32ms │ no change │
└──────────────┴────────────┴─────────────────────────┴───────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 25579.80ms │
│ Total Time (concat_batches_for_sort)   │ 25860.05ms │
│ Average Time (HEAD)                    │  3197.47ms │
│ Average Time (concat_batches_for_sort) │  3232.51ms │
│ Queries Faster                         │          0 │
│ Queries Slower                         │          0 │
│ Queries with No Change                 │          8 │
└────────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       HEAD ┃ concat_batches_for_sort ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.27ms │                  2.24ms │     no change │
│ QQuery 1     │    33.80ms │                 33.24ms │     no change │
│ QQuery 2     │    81.15ms │                 82.06ms │     no change │
│ QQuery 3     │    99.55ms │                 95.85ms │     no change │
│ QQuery 4     │   622.18ms │                580.23ms │ +1.07x faster │
│ QQuery 5     │   861.79ms │                846.12ms │     no change │
│ QQuery 6     │     2.29ms │                  2.26ms │     no change │
│ QQuery 7     │    40.57ms │                 37.96ms │ +1.07x faster │
│ QQuery 8     │   908.51ms │                901.02ms │     no change │
│ QQuery 9     │  1194.29ms │               1193.39ms │     no change │
│ QQuery 10    │   262.64ms │                263.88ms │     no change │
│ QQuery 11    │   300.75ms │                298.34ms │     no change │
│ QQuery 12    │   898.63ms │                919.13ms │     no change │
│ QQuery 13    │  1384.37ms │               1341.06ms │     no change │
│ QQuery 14    │   838.27ms │                826.54ms │     no change │
│ QQuery 15    │   818.35ms │                838.96ms │     no change │
│ QQuery 16    │  1730.53ms │               1737.15ms │     no change │
│ QQuery 17    │  1605.75ms │               1613.31ms │     no change │
│ QQuery 18    │  3095.90ms │               3076.04ms │     no change │
│ QQuery 19    │    86.19ms │                 81.71ms │ +1.05x faster │
│ QQuery 20    │  1149.07ms │               1140.51ms │     no change │
│ QQuery 21    │  1336.01ms │               1317.02ms │     no change │
│ QQuery 22    │  2187.00ms │               2217.02ms │     no change │
│ QQuery 23    │  8069.32ms │               8165.82ms │     no change │
│ QQuery 24    │   475.17ms │                465.34ms │     no change │
│ QQuery 25    │   396.18ms │                394.18ms │     no change │
│ QQuery 26    │   533.64ms │                540.67ms │     no change │
│ QQuery 27    │  1608.41ms │               1571.15ms │     no change │
│ QQuery 28    │ 13090.47ms │              12353.69ms │ +1.06x faster │
│ QQuery 29    │   523.63ms │                526.27ms │     no change │
│ QQuery 30    │   813.95ms │                822.18ms │     no change │
│ QQuery 31    │   853.08ms │                853.21ms │     no change │
│ QQuery 32    │  2696.12ms │               2652.83ms │     no change │
│ QQuery 33    │  3361.98ms │               3319.28ms │     no change │
│ QQuery 34    │  3412.54ms │               3376.67ms │     no change │
│ QQuery 35    │  1318.71ms │               1281.69ms │     no change │
│ QQuery 36    │   128.69ms │                122.04ms │ +1.05x faster │
│ QQuery 37    │    58.63ms │                 55.16ms │ +1.06x faster │
│ QQuery 38    │   124.19ms │                127.18ms │     no change │
│ QQuery 39    │   200.50ms │                194.61ms │     no change │
│ QQuery 40    │    46.81ms │                 48.75ms │     no change │
│ QQuery 41    │    44.37ms │                 45.22ms │     no change │
│ QQuery 42    │    38.09ms │                 38.91ms │     no change │
└──────────────┴────────────┴─────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 57334.34ms │
│ Total Time (concat_batches_for_sort)   │ 56399.90ms │
│ Average Time (HEAD)                    │  1333.36ms │
│ Average Time (concat_batches_for_sort) │  1311.63ms │
│ Queries Faster                         │          6 │
│ Queries Slower                         │          0 │
│ Queries with No Change                 │         37 │
└────────────────────────────────────────┴────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃     HEAD ┃ concat_batches_for_sort ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 1     │ 125.18ms │                122.61ms │    no change │
│ QQuery 2     │  23.85ms │                 23.73ms │    no change │
│ QQuery 3     │  34.98ms │                 34.47ms │    no change │
│ QQuery 4     │  20.64ms │                 20.64ms │    no change │
│ QQuery 5     │  54.68ms │                 54.43ms │    no change │
│ QQuery 6     │  12.45ms │                 12.28ms │    no change │
│ QQuery 7     │ 107.92ms │                103.45ms │    no change │
│ QQuery 8     │  25.80ms │                 26.69ms │    no change │
│ QQuery 9     │  62.37ms │                 61.01ms │    no change │
│ QQuery 10    │  57.63ms │                 58.14ms │    no change │
│ QQuery 11    │  12.93ms │                 13.54ms │    no change │
│ QQuery 12    │  43.64ms │                 47.78ms │ 1.09x slower │
│ QQuery 13    │  30.68ms │                 30.12ms │    no change │
│ QQuery 14    │  10.10ms │                 10.02ms │    no change │
│ QQuery 15    │  26.05ms │                 25.20ms │    no change │
│ QQuery 16    │  22.64ms │                 23.33ms │    no change │
│ QQuery 17    │ 100.13ms │                103.06ms │    no change │
│ QQuery 18    │ 236.67ms │                238.11ms │    no change │
│ QQuery 19    │  27.18ms │                 28.47ms │    no change │
│ QQuery 20    │  39.12ms │                 41.04ms │    no change │
│ QQuery 21    │ 168.75ms │                166.57ms │    no change │
│ QQuery 22    │  17.34ms │                 16.86ms │    no change │
└──────────────┴──────────┴─────────────────────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 1260.73ms │
│ Total Time (concat_batches_for_sort)   │ 1261.52ms │
│ Average Time (HEAD)                    │   57.31ms │
│ Average Time (concat_batches_for_sort) │   57.34ms │
│ Queries Faster                         │         0 │
│ Queries Slower                         │         1 │
│ Queries with No Change                 │        21 │
└────────────────────────────────────────┴───────────┘

Thank you @alamb, sorry that we need to trigger the benchmark again for latest changes to see the smallest changes result.

alamb · 2025-05-20T15:07:19Z

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.11.0-1013-gcp #13~24.04.1-Ubuntu SMP Wed Apr 2 16:34:16 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing concat_batches_for_sort (485499f) to 3e30f77 diff
Benchmarks: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

Dandandan · 2025-05-20T15:14:19Z

Looking at the earlier result

│ Q1 │ 333.47ms │ 375.58ms │ 1.13x slower │

This is the query

        SELECT l_linenumber, l_partkey
        FROM lineitem
        ORDER BY l_linenumber

So it might actually be the case that the changed code is a bit slower for this case. In the query there is only little data to copy (so concat batches -> concat sort keys doesn't help that much) while maybe the overhead of using interleave_batches is higher 🤔

alamb · 2025-05-20T15:49:16Z

🤖: Benchmark completed

Details

Comparing HEAD and concat_batches_for_sort
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃       HEAD ┃ concat_batches_for_sort ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 0     │  1865.36ms │               2095.69ms │ 1.12x slower │
│ QQuery 1     │   701.32ms │                710.76ms │    no change │
│ QQuery 2     │  1453.42ms │               1442.06ms │    no change │
│ QQuery 3     │   700.26ms │                711.29ms │    no change │
│ QQuery 4     │  1437.90ms │               1482.90ms │    no change │
│ QQuery 5     │ 15296.81ms │              15446.47ms │    no change │
│ QQuery 6     │  2051.26ms │               2069.51ms │    no change │
│ QQuery 7     │  2136.98ms │               2120.09ms │    no change │
│ QQuery 8     │  2079.21ms │               2103.34ms │    no change │
└──────────────┴────────────┴─────────────────────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 27722.52ms │
│ Total Time (concat_batches_for_sort)   │ 28182.11ms │
│ Average Time (HEAD)                    │  3080.28ms │
│ Average Time (concat_batches_for_sort) │  3131.35ms │
│ Queries Faster                         │          0 │
│ Queries Slower                         │          1 │
│ Queries with No Change                 │          8 │
└────────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       HEAD ┃ concat_batches_for_sort ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.27ms │                  2.34ms │     no change │
│ QQuery 1     │    33.89ms │                 35.06ms │     no change │
│ QQuery 2     │    80.84ms │                 81.71ms │     no change │
│ QQuery 3     │    96.89ms │                 98.27ms │     no change │
│ QQuery 4     │   587.18ms │                592.23ms │     no change │
│ QQuery 5     │   871.05ms │                846.05ms │     no change │
│ QQuery 6     │     2.29ms │                  2.29ms │     no change │
│ QQuery 7     │    40.92ms │                 39.49ms │     no change │
│ QQuery 8     │   916.50ms │                921.84ms │     no change │
│ QQuery 9     │  1235.05ms │               1205.14ms │     no change │
│ QQuery 10    │   270.45ms │                266.99ms │     no change │
│ QQuery 11    │   308.22ms │                308.32ms │     no change │
│ QQuery 12    │   906.16ms │                911.28ms │     no change │
│ QQuery 13    │  1190.03ms │               1364.47ms │  1.15x slower │
│ QQuery 14    │   861.45ms │                837.70ms │     no change │
│ QQuery 15    │   840.69ms │                824.48ms │     no change │
│ QQuery 16    │  1738.36ms │               1717.85ms │     no change │
│ QQuery 17    │  1605.80ms │               1603.35ms │     no change │
│ QQuery 18    │  3088.42ms │               3074.28ms │     no change │
│ QQuery 19    │    83.36ms │                 88.70ms │  1.06x slower │
│ QQuery 20    │  1133.60ms │               1134.33ms │     no change │
│ QQuery 21    │  1320.19ms │               1312.32ms │     no change │
│ QQuery 22    │  2224.06ms │               2200.69ms │     no change │
│ QQuery 23    │  8080.77ms │               8128.07ms │     no change │
│ QQuery 24    │   477.97ms │                478.34ms │     no change │
│ QQuery 25    │   390.14ms │                391.28ms │     no change │
│ QQuery 26    │   552.37ms │                539.99ms │     no change │
│ QQuery 27    │  1591.16ms │               1564.84ms │     no change │
│ QQuery 28    │ 13161.07ms │              12339.15ms │ +1.07x faster │
│ QQuery 29    │   547.45ms │                538.58ms │     no change │
│ QQuery 30    │   813.88ms │                818.90ms │     no change │
│ QQuery 31    │   856.38ms │                871.60ms │     no change │
│ QQuery 32    │  2624.66ms │               2669.71ms │     no change │
│ QQuery 33    │  3320.68ms │               3363.16ms │     no change │
│ QQuery 34    │  3382.83ms │               3434.36ms │     no change │
│ QQuery 35    │  1312.99ms │               1276.70ms │     no change │
│ QQuery 36    │   123.92ms │                126.90ms │     no change │
│ QQuery 37    │    58.15ms │                 55.63ms │     no change │
│ QQuery 38    │   123.37ms │                123.18ms │     no change │
│ QQuery 39    │   201.66ms │                196.83ms │     no change │
│ QQuery 40    │    49.85ms │                 48.28ms │     no change │
│ QQuery 41    │    46.54ms │                 44.78ms │     no change │
│ QQuery 42    │    40.40ms │                 38.56ms │     no change │
└──────────────┴────────────┴─────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 57193.89ms │
│ Total Time (concat_batches_for_sort)   │ 56518.02ms │
│ Average Time (HEAD)                    │  1330.09ms │
│ Average Time (concat_batches_for_sort) │  1314.37ms │
│ Queries Faster                         │          1 │
│ Queries Slower                         │          2 │
│ Queries with No Change                 │         40 │
└────────────────────────────────────────┴────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃     HEAD ┃ concat_batches_for_sort ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 124.69ms │                123.12ms │     no change │
│ QQuery 2     │  24.66ms │                 23.74ms │     no change │
│ QQuery 3     │  35.50ms │                 36.17ms │     no change │
│ QQuery 4     │  20.90ms │                 21.45ms │     no change │
│ QQuery 5     │  56.48ms │                 55.99ms │     no change │
│ QQuery 6     │  12.21ms │                 12.46ms │     no change │
│ QQuery 7     │ 110.00ms │                103.26ms │ +1.07x faster │
│ QQuery 8     │  25.99ms │                 27.11ms │     no change │
│ QQuery 9     │  63.52ms │                 63.21ms │     no change │
│ QQuery 10    │  60.10ms │                 58.58ms │     no change │
│ QQuery 11    │  13.17ms │                 13.12ms │     no change │
│ QQuery 12    │  47.29ms │                 43.23ms │ +1.09x faster │
│ QQuery 13    │  30.98ms │                 30.00ms │     no change │
│ QQuery 14    │  10.32ms │                 10.57ms │     no change │
│ QQuery 15    │  26.03ms │                 25.47ms │     no change │
│ QQuery 16    │  23.95ms │                 23.14ms │     no change │
│ QQuery 17    │ 103.56ms │                103.39ms │     no change │
│ QQuery 18    │ 241.08ms │                240.55ms │     no change │
│ QQuery 19    │  27.15ms │                 28.33ms │     no change │
│ QQuery 20    │  39.67ms │                 40.16ms │     no change │
│ QQuery 21    │ 173.56ms │                174.10ms │     no change │
│ QQuery 22    │  18.03ms │                 18.25ms │     no change │
└──────────────┴──────────┴─────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 1288.83ms │
│ Total Time (concat_batches_for_sort)   │ 1275.42ms │
│ Average Time (HEAD)                    │   58.58ms │
│ Average Time (concat_batches_for_sort) │   57.97ms │
│ Queries Faster                         │         2 │
│ Queries Slower                         │         0 │
│ Queries with No Change                 │        20 │
└────────────────────────────────────────┴───────────┘

alamb · 2025-05-20T15:49:18Z

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.11.0-1013-gcp #13~24.04.1-Ubuntu SMP Wed Apr 2 16:34:16 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing concat_batches_for_sort (485499f) to 3e30f77 diff
Benchmarks: sort_tpch
Results will be posted here when complete

alamb · 2025-05-20T15:50:40Z

🤖: Benchmark completed

Details

Comparing HEAD and concat_batches_for_sort
--------------------
Benchmark sort_tpch.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ concat_batches_for_sort ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ Q1           │  346.79ms │                372.60ms │  1.07x slower │
│ Q2           │  309.33ms │                273.93ms │ +1.13x faster │
│ Q3           │ 1213.14ms │               1202.90ms │     no change │
│ Q4           │  431.58ms │                432.34ms │     no change │
│ Q5           │  427.71ms │                444.55ms │     no change │
│ Q6           │  478.91ms │                456.74ms │     no change │
│ Q7           │  942.76ms │                921.41ms │     no change │
│ Q8           │  785.64ms │                792.43ms │     no change │
│ Q9           │  826.21ms │                831.16ms │     no change │
│ Q10          │ 1235.28ms │               1216.72ms │     no change │
│ Q11          │  737.28ms │                775.00ms │  1.05x slower │
└──────────────┴───────────┴─────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 7734.64ms │
│ Total Time (concat_batches_for_sort)   │ 7719.79ms │
│ Average Time (HEAD)                    │  703.15ms │
│ Average Time (concat_batches_for_sort) │  701.80ms │
│ Queries Faster                         │         1 │
│ Queries Slower                         │         2 │
│ Queries with No Change                 │         8 │
└────────────────────────────────────────┴───────────┘

zhuqi-lucas marked this pull request as draft March 24, 2025 09:42

Dandandan reviewed Apr 12, 2025

View reviewed changes

zhuqi-lucas marked this pull request as ready for review April 12, 2025 14:41

zhuqi-lucas changed the title ~~PoC (Perf: Support automatically concat_batches for sort which will improve performance)~~ Perf: Support automatically concat_batches for sort which will improve performance Apr 12, 2025

This comment was marked as outdated.

Sign in to view

zhuqi-lucas added 2 commits May 17, 2025 22:35

Add smaller optimization

b12c912

Merge remote-tracking branch 'upstream/main' into concat_batches_for_…

223b9d9

…sort

zhuqi-lucas force-pushed the concat_batches_for_sort branch from 6a3b4e7 to 223b9d9 Compare May 17, 2025 14:37

github-actions bot removed documentation Improvements or additions to documentation core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) common Related to common crate labels May 17, 2025

Dandandan reviewed May 19, 2025

View reviewed changes

Address new comments

cf813d2

github-actions bot added the physical-plan Changes to the physical-plan crate label May 20, 2025

clean

ea4b867

fix test

485499f

Perf: Optimize in memory sort #15380

Are you sure you want to change the base?

Perf: Optimize in memory sort #15380

Conversation

zhuqi-lucas commented Mar 24, 2025 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Dandandan Apr 12, 2025

Choose a reason for hiding this comment

zhuqi-lucas Apr 12, 2025

Choose a reason for hiding this comment

Dandandan commented Apr 12, 2025

zhuqi-lucas commented Apr 12, 2025

zhuqi-lucas commented Apr 12, 2025

zhuqi-lucas commented Apr 12, 2025

Dandandan commented Apr 12, 2025

alamb commented Apr 14, 2025

alamb commented Apr 14, 2025

This comment was marked as outdated.

Dandandan commented Apr 15, 2025

zhuqi-lucas commented Apr 15, 2025

zhuqi-lucas commented Apr 15, 2025

Dandandan commented Apr 15, 2025 • edited Loading

Dandandan commented Apr 15, 2025

zhuqi-lucas commented Apr 15, 2025 • edited Loading

zhuqi-lucas commented Apr 15, 2025

Dandandan commented Apr 15, 2025 • edited Loading

zhuqi-lucas commented Apr 15, 2025 • edited Loading

2010YOUY01 commented Apr 15, 2025

Dandandan commented Apr 15, 2025

zhuqi-lucas commented Apr 15, 2025 • edited Loading

alamb commented Apr 15, 2025

alamb commented Apr 15, 2025

alamb commented Apr 15, 2025

alamb commented Apr 15, 2025

zhuqi-lucas commented Apr 16, 2025

zhuqi-lucas commented Apr 16, 2025 • edited Loading

alamb commented May 16, 2025 • edited Loading

alamb commented May 16, 2025

alamb commented May 16, 2025

alamb commented May 16, 2025

alamb commented May 16, 2025

Dandandan commented May 17, 2025 • edited Loading

zhuqi-lucas commented May 17, 2025

zhuqi-lucas commented May 17, 2025

alamb commented May 19, 2025

alamb commented May 19, 2025

alamb commented May 19, 2025

alamb commented May 19, 2025

Dandandan May 19, 2025

Choose a reason for hiding this comment

zhuqi-lucas May 20, 2025

Choose a reason for hiding this comment

zhuqi-lucas commented May 20, 2025 • edited Loading

alamb commented May 20, 2025

Dandandan commented May 20, 2025

alamb commented May 20, 2025

alamb commented May 20, 2025

alamb commented May 20, 2025

zhuqi-lucas commented Mar 24, 2025 •

edited

Loading

Dandandan commented Apr 15, 2025 •

edited

Loading

zhuqi-lucas commented Apr 15, 2025 •

edited

Loading

Dandandan commented Apr 15, 2025 •

edited

Loading

zhuqi-lucas commented Apr 15, 2025 •

edited

Loading

zhuqi-lucas commented Apr 15, 2025 •

edited

Loading

zhuqi-lucas commented Apr 16, 2025 •

edited

Loading

alamb commented May 16, 2025 •

edited

Loading

Dandandan commented May 17, 2025 •

edited

Loading

zhuqi-lucas commented May 20, 2025 •

edited

Loading